1 Introduction

2 Purpose

[…]

The functions to be executed in the local side (i.e. data analyst machine) will extremely facilitate the bioinformatic analyses by running simple R sentences for each of the required genomic data analysis. These functions are implemented in the dsOmicsClient package that includes:

  • Principal component analyses (population stratification)
  • Allele frequency
  • Hardy-Weinberg equillibrium test
  • Single association analyses (meta-analysis and mega-analysis, e.g pooled data)
  • GWAS (meta-analysis)
  • Polygenic risk score (PRS)

The package also allows to perform:

  • Epigenome-wide association analyses
  • Meta-analyses
  • Post-omic data analyses (e.g. pathway/enrichment)

3 Use cases

Three different use cases will be illustrated along this document; the first one, will show a single cohort example that has it’s variant calling data (VCF) files separated by chromosome; the second one, will show a multi-cohort (two cohorts for simplicity) scenario also with files separated by chromosome; the third one, will illustrate the calculation of polygenic risk scores for the individuals. It’s important noting that both use cases can be run by using a single VCF files that contains all the variants, it is just not advisable from a performance point of view, internally we are converting the files to GDS data containers, and by doing that, R tries to allocate the whole files to the RAM, it is more manageable to use single chromosome VCF files for that reason, nevertheless if the Opal servers in use are powerful enough that is not a limitation.

3.1 Single cohort

The single cohort analysis is a way of performing a GWAS study guaranteeing GDPR data confidentiality. The structure followed is illustrated on the following figure.

Proposed infrastructure to perform single-cohort GWAS studies.

(#fig:single_cohort_image)Proposed infrastructure to perform single-cohort GWAS studies.

The data analyst corresponds to the “RStudio” session, which through DataSHIELD Interface (DSI) connects with the Opal server located at the cohort network. This Opal server contains an array of resources that correspond to the different VCF files (sliced by chromosome)1 The same methodology and code can be used with unitary VCF resources that contain the variant information of all chromosomes. and a resource that corresponds to the phenotypes table of the studied individuals. This resources are URL pointers to the Data store of the Opal server, so that the data is contained through all the analysis inside the cohort network, and the only data being transfered are non-disclosive statistics.

3.1.1 Connection to the Opal server

We have to create an Opal connection object to the cohort server. We do that using the following DSI functions.

require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')

builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
               user =  "dsuser", password = "password",
               driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)

3.1.2 Assign the VCF resources

Now that we have created a connection object to the Opal, we have triggered a new R session on the server. Once this R session is active, we can begin assigning resources to it.

DSI::datashield.assign.resource(conns, "chr19", "GWAS.chr19A")
DSI::datashield.assign.resource(conns, "chr21", "GWAS.chr21A")

This resources that we have assigned are pointers to the VCF files that we want to use. To access the data and load it into the server, we have to resolve the resources.

DSI::datashield.assign.expr(conns = conns, symbol = "gds19_object", 
                            expr = quote(as.resource.object(chr19)))
DSI::datashield.assign.expr(conns = conns, symbol = "gds21_object", 
                            expr = quote(as.resource.object(chr21)))

We can check he assignment was successful by checking the class of the new objects.

ds.class("chr19")
## $cohort1
## [1] "GDSFileResourceClient" "FileResourceClient"    "ResourceClient"       
## [4] "R6"
ds.class("gds19_object")
## $cohort1
## [1] "GdsGenotypeReader"
## attr(,"package")
## [1] "GWASTools"

3.1.3 Assign the covariates resource

On this use case the covariates table is a resource, please note that it could be a regular Opal table. Since it is a resource we follow the same procedure as with the VCF resources. The only change is that we have to specify that we are resolving a resource into a table instead of into an object, so in this case we will use as.resource.data.frame.

DSI::datashield.assign.resource(conns, "feno", "GWAS.feno")
DSI::datashield.assign.expr(conns = conns, symbol = "feno_object",
                            expr = quote(as.resource.data.frame(feno)))

If the covariates were contained on a regular Opal table, the assignment procedure would be the following.

DSI::datashield.assign.table(conns, "feno_object", "project.covar_table")

When we assign a table, we don’t have to resolve it as with the resources.

3.1.4 Create the Genotype Data objects

Now that we have both the GDS (container for VCF files) and covariates loaded into the R session, we have to merge them into a type of container called GenotypeData. Some things have to be considered before performing this merge:

  • Which column has the samples identifier? (mandatory to have it on the covariates file)
  • Which is the sex column on the covariates file? (mandatory to have it on the covariates file)
  • How are males and how are females encoded into this column?
  • Which is the case-control column of interest of the covariates?
  • What is the encoding for case and for control on this column?

It’s important to note that all the other encodings present on the case-control column that are not the case or control, will be turned into NAs.

We can use DataSHIELD to answer the aforementioned questions:

Getting the ID, sex and case-control columns:

ds.colnames("feno_object")
## $cohort1
##  [1] "Age at cancer diagnosis"                                 
##  [2] "Age at recruitment"                                      
##  [3] "Amount of alcohol drunk on a typical drinking day"       
##  [4] "Basophill count"                                         
##  [5] "Behaviour of cancer tumour"                              
##  [6] "Body mass index (BMI)"                                   
##  [7] "C-reactive protein reportability"                        
##  [8] "Cancer record origin"                                    
##  [9] "Cancer report format"                                    
## [10] "Cholesterol"                                             
## [11] "Country of birth (UK/elsewhere)"                         
## [12] "Current tobacco smoking"                                 
## [13] "Date of cancer diagnosis"                                
## [14] "Diabetes diagnosed by doctor"                            
## [15] "Diastolic blood pressure, automated reading"             
## [16] "Duration of moderate activity"                           
## [17] "Duration of vigorous activity"                           
## [18] "Energy"                                                  
## [19] "Eosinophill count"                                       
## [20] "Ethnic background"                                       
## [21] "Ever smoked"                                             
## [22] "Fasting time"                                            
## [23] "Frequency of drinking alcohol"                           
## [24] "HDL cholesterol"                                         
## [25] "Hip circumference"                                       
## [26] "Histology of cancer tumour"                              
## [27] "Home location - east co-ordinate"                        
## [28] "Home location - north co-ordinate"                       
## [29] "Housing score (England)"                                 
## [30] "Housing score (Scotland)"                                
## [31] "Housing score (Wales)"                                   
## [32] "Income score (England)"                                  
## [33] "Income score (Wales)"                                    
## [34] "LDL direct"                                              
## [35] "Lymphocyte count"                                        
## [36] "Monocyte count"                                          
## [37] "Neutrophill count"                                       
## [38] "Non-accidental death in close genetic family"            
## [39] "Number in household"                                     
## [40] "Number of children fathered"                             
## [41] "Number of operations, self-reported"                     
## [42] "Operative procedures - OPCS4"                            
## [43] "Past tobacco smoking"                                    
## [44] "Platelet count"                                          
## [45] "Pulse rate, automated reading"                           
## [46] "Qualifications"                                          
## [47] "Reported occurrences of cancer"                          
## [48] "Reticulocyte count"                                      
## [49] "Sleep duration"                                          
## [50] "Standing height"                                         
## [51] "subject_id"                                              
## [52] "Triglycerides"                                           
## [53] "Type of cancer: ICD10"                                   
## [54] "Type of cancer: ICD10 addendum"                          
## [55] "Type of cancer: ICD9"                                    
## [56] "Weight"                                                  
## [57] "White blood cell (leukocyte) count"                      
## [58] "alias"                                                   
## [59] "description"                                             
## [60] "organism"                                                
## [61] "project"                                                 
## [62] "sex"                                                     
## [63] "phenotype"                                               
## [64] "BioSample_name"                                          
## [65] "ENA-CHECKLIST"                                           
## [66] "Age at death"                                            
## [67] "Age diabetes diagnosed"                                  
## [68] "Date K85 first reported (acute pancreatitis)"            
## [69] "Date K86 first reported (other diseases of pancreas)"    
## [70] "Date of death"                                           
## [71] "Endocrine, nutritional and metabolic diseases"           
## [72] "Source of report of K85 (acute pancreatitis)"            
## [73] "Source of report of K86 (other diseases of pancreas)"    
## [74] "Age high blood pressure diagnosed"                       
## [75] "Mental and behavioural disorders"                        
## [76] "Respiratory system disorders"                            
## [77] "Cancer year/age first occurred"                          
## [78] "Circulatory system disorders"                            
## [79] "Digestive system disorders"                              
## [80] "Nervous system disorders"                                
## [81] "Blood, blood-forming organs and certain immune disorders"
  • ID col: Number 51
  • Sex col: Number 62
  • case-control col: 38

Getting the encodings of the sex and case-control columns

ds.table("feno_object$sex")
## 
##  Data in all studies were valid 
## 
## Study 1 :  No errors reported from this study
## $output.list
## $output.list$TABLE_rvar.by.study_row.props
##                study
## feno_object$sex cohort1
##          female       1
##          male         1
## 
## $output.list$TABLE_rvar.by.study_col.props
##                study
## feno_object$sex   cohort1
##          female 0.5075879
##          male   0.4924121
## 
## $output.list$TABLE_rvar.by.study_counts
##                study
## feno_object$sex cohort1
##          female    1271
##          male      1233
## 
## $output.list$TABLES.COMBINED_all.sources_proportions
## feno_object$sex
## female   male 
##  0.508  0.492 
## 
## $output.list$TABLES.COMBINED_all.sources_counts
## feno_object$sex
## female   male 
##   1271   1233 
## 
## 
## $validity.message
## [1] "Data in all studies were valid"
# ds.table("feno_object$`Non-accidental death in close genetic family`")
  • Male encoding: male
  • Female encoding: female
  • Case encoding: Yes
  • Control encoding: No

With all this information we can now merge the covariates and GDS data to a GenotypeData object. We have to do it for every chromosome keeping the same covariates file for all of them.

ds.GenotypeData(x='gds19_object', covars = 'feno_object', columnId = 51, sexId = 62, 
                male_encoding = "male", female_encoding = "female",
                case_control_column = 38, case = "Yes", control = "No",
                newobj.name = 'gds.Data19', datasources = conns)
ds.GenotypeData(x='gds21_object', covars = 'feno_object', columnId = 51, sexId = 62, 
                male_encoding = "male", female_encoding = "female",
                case_control_column = 38, case = "Yes", control = "No",
                newobj.name = 'gds.Data21', datasources = conns)

Now we are ready to perform a GWAS analysis. It is important noting that we have to pass a formula object to the ds.GWAS function. The variables from the covariates file may contain some special characters and spaces that are not allowed by the R formula element, to account for that we should use th make.names function. As an example, we may want to study the following:

Non-accidental death in close genetic family ~ sex + HDL cholesterol

We pass every variable name through make.names.

make.names("Non-accidental death in close genetic family")
## [1] "Non.accidental.death.in.close.genetic.family"
make.names("sex")
## [1] "sex"
make.names("HDL cholesterol")
## [1] "HDL.cholesterol"

And we construct or final formula with the outputs.

Non.accidental.death.in.close.genetic.family ~ sex + HDL.cholesterol

3.1.5 Performing the GWAS

We now proceed to perform the GWAS analysis.

results <- ds.GWAS(genoData = c("gds.Data19", "gds.Data21"), model = Non.accidental.death.in.close.genetic.family ~ sex + HDL.cholesterol)[[1]]

DT::datatable(results, class = 'cell-border stripe', width = '90%', options=list(columnDefs = list(list(visible=FALSE, targets=0)))) %>% DT::formatRound(columns = c(5:8), digits = 3)

We can display the results of the GWAS using a Manhattan plot.

manhattan(results)

datashield.logout(conns)

3.2 Multi cohorts

A GWAS can be performed on a multi-cohort situation to get results for all the cohorts to be analyzed using meta-analysis techniques. As with the single-cohort methodology illustrated on the prior section, this method guarantees GDPR data confidentiality. The structure for a two cohort study is illustrated on the following figure (note this can be extrapolated for cohorts with a bigger partner count).

Proposed infrastructure to perform multi-cohort GWAS studies.

(#fig:multi_cohort_image)Proposed infrastructure to perform multi-cohort GWAS studies.

The data analyst corresponds to the “RStudio” session, which through DataSHIELD Interface (DSI) connects with all the different Opal servers located at the cohorts network (one Opal per cohort). This Opal servers contain an array of resources that correspond to the different VCF files (sliced by chromosome)2 The same methodology and code can be used with unitary VCF resources that contain the variant information of all chromosomes. and a resource that corresponds to the phenotypes table of the studied individuals. This resources are URL pointers to the Data store of each Opal server, so that the data is contained through all the analysis inside the cohort network, and the only data being transfered are non-disclosive statistics.

3.2.1 Connection to the Opal server

We have to create an Opal connection object to the different cohorts servers. We do that using the following DSI functions.

require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')

builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
               user =  "dsuser", password = "password",
               driver = "OpalDriver", profile = "omics")
builder$append(server = "cohort2", url = "https://opal-demo.obiba.org/",
               user =  "dsuser", password = "password",
               driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)

It is important to note that in this illustrated example, we are using only one server that contains all the resources (https://opal-demo.obiba.org/), on this server there are all the resources that correspond to the different cohorts. From a technical point of view we are making two different connections to the server to simulate this multi-cohort scenario. A more realistic code would be the following one (fake URLs, don’t try to reproduce it).

require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')

builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal.cohort1.org/",
               user =  "dsuser_cohort_1", password = "password_cohort_1",
               driver = "OpalDriver")
builder$append(server = "cohort2", url = "https://opal.cohort2.org/",
               user =  "dsuser_cohort_2", password = "password_cohort_2",
               driver = "OpalDriver")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)

3.2.2 Assign the VCF resources

Now that we have created a connection object to the different Opals, we have triggered a new R session on all the servers. Once this R sessions are active, we can begin assigning resources to them. To do so we have to specify which connection object we are talking with, we can take a look at them beforehand.

conns
## $cohort1
## An object of class "OpalConnection"
## Slot "name":
## [1] "cohort1"
## 
## Slot "opal":
## url: https://opal-demo.obiba.org 
## name: cohort1 
## version: 4.2.1 
## username: dsuser 
## profile: omics 
## 
## 
## $cohort2
## An object of class "OpalConnection"
## Slot "name":
## [1] "cohort2"
## 
## Slot "opal":
## url: https://opal-demo.obiba.org 
## name: cohort2 
## version: 4.2.1 
## username: dsuser 
## profile: omics

We see that the first connection object is the cohort1 and the second object is the cohort2.

# Cohort 1 resources
DSI::datashield.assign.resource(conns[1], "chr19", "GWAS.chr19A")
DSI::datashield.assign.resource(conns[1], "chr21", "GWAS.chr21A")

# Cohort 2 resources
DSI::datashield.assign.resource(conns[2], "chr19", "GWAS.chr19B")
DSI::datashield.assign.resource(conns[2], "chr21", "GWAS.chr21B")

We can check that we did the assignments properly by checking the workspaces of all the connections.

ds.ls(datasources = conns)
## $cohort1
## $cohort1$environment.searched
## [1] "R_GlobalEnv"
## 
## $cohort1$objects.found
## [1] "chr19" "chr21"
## 
## 
## $cohort2
## $cohort2$environment.searched
## [1] "R_GlobalEnv"
## 
## $cohort2$objects.found
## [1] "chr19" "chr21"

This resources that we have assigned are pointers to the VCF files that we want to use. To access the data and load it into the server, we have to resolve the resources. Since on all the cohort servers we assigned the resources to the same variables (chr19 and chr21), we don’t have to manipulate them separately as we did on the previous code chunk.

DSI::datashield.assign.expr(conns = conns, symbol = "gds19_object", 
                            expr = quote(as.resource.object(chr19)))
DSI::datashield.assign.expr(conns = conns, symbol = "gds21_object", 
                            expr = quote(as.resource.object(chr21)))

We can check he assignment was successful by checking the class of the new objects.

ds.class("chr19")
## $cohort1
## [1] "GDSFileResourceClient" "FileResourceClient"    "ResourceClient"       
## [4] "R6"                   
## 
## $cohort2
## [1] "GDSFileResourceClient" "FileResourceClient"    "ResourceClient"       
## [4] "R6"
ds.class("gds19_object")
## $cohort1
## [1] "GdsGenotypeReader"
## attr(,"package")
## [1] "GWASTools"
## 
## $cohort2
## [1] "GdsGenotypeReader"
## attr(,"package")
## [1] "GWASTools"

3.2.3 Assign the covariates resource

On this use case the covariates table is a resource, please note that it could be a regular Opal table. Since it is a resource we follow the same procedure as with the VCF resources. The only change is that we have to specify that we are resolving a resource into a table instead of into an object, so in this case we will use as.resource.data.frame. It is important to take into account that this covariates table has to be harmonized between all the cohorts, that means having the same structure and column order, in the case this condition is not met, we can still perform the analysis, we just have to be careful on the the following section (creating the Genotype Data objects) to send different parameters to each server in a similar fashion as the one used on the Assign the VCF resources section.

DSI::datashield.assign.resource(conns, "feno", "GWAS.feno")
DSI::datashield.assign.expr(conns = conns, symbol = "feno_object",
                            expr = quote(as.resource.data.frame(feno)))

If the covariates were contained on a regular Opal table, the assignment procedure would be the following.

DSI::datashield.assign.table(conns, "feno_object", "project.covar_table")

When we assign a table, we don’t have to resolve it as with the resources.

3.2.4 Create the Genotype Data objects

Now that we have both the GDS (container for VCF files) and covariates loaded into the R sessions, we have to merge them into a type of container called GenotypeData. Some things have to be considered before performing this merge:

  • Which column has the samples identifier? (mandatory to have it on the covariates file)
  • Which is the sex column on the covariates file? (mandatory to have it on the covariates file)
  • How are males and how are females encoded into this column?
  • Which is the case-control column of interest of the covariates?
  • What is the encoding for case and for control on this column?

As mentioned before, for simplicity, the covariates table has to be harmonized between all the cohorts, otherwise we will have to pass each R session different arguments corresponding to this questions we will answer.

It’s important to note that all the other encodings present on the case-control column that are not the case or control, will be turned into NAs.

We can use DataSHIELD to answer the aforementioned questions:

Getting the ID, sex and case-control columns:

ds.colnames("feno_object")
## $cohort1
##  [1] "Age at cancer diagnosis"                                 
##  [2] "Age at recruitment"                                      
##  [3] "Amount of alcohol drunk on a typical drinking day"       
##  [4] "Basophill count"                                         
##  [5] "Behaviour of cancer tumour"                              
##  [6] "Body mass index (BMI)"                                   
##  [7] "C-reactive protein reportability"                        
##  [8] "Cancer record origin"                                    
##  [9] "Cancer report format"                                    
## [10] "Cholesterol"                                             
## [11] "Country of birth (UK/elsewhere)"                         
## [12] "Current tobacco smoking"                                 
## [13] "Date of cancer diagnosis"                                
## [14] "Diabetes diagnosed by doctor"                            
## [15] "Diastolic blood pressure, automated reading"             
## [16] "Duration of moderate activity"                           
## [17] "Duration of vigorous activity"                           
## [18] "Energy"                                                  
## [19] "Eosinophill count"                                       
## [20] "Ethnic background"                                       
## [21] "Ever smoked"                                             
## [22] "Fasting time"                                            
## [23] "Frequency of drinking alcohol"                           
## [24] "HDL cholesterol"                                         
## [25] "Hip circumference"                                       
## [26] "Histology of cancer tumour"                              
## [27] "Home location - east co-ordinate"                        
## [28] "Home location - north co-ordinate"                       
## [29] "Housing score (England)"                                 
## [30] "Housing score (Scotland)"                                
## [31] "Housing score (Wales)"                                   
## [32] "Income score (England)"                                  
## [33] "Income score (Wales)"                                    
## [34] "LDL direct"                                              
## [35] "Lymphocyte count"                                        
## [36] "Monocyte count"                                          
## [37] "Neutrophill count"                                       
## [38] "Non-accidental death in close genetic family"            
## [39] "Number in household"                                     
## [40] "Number of children fathered"                             
## [41] "Number of operations, self-reported"                     
## [42] "Operative procedures - OPCS4"                            
## [43] "Past tobacco smoking"                                    
## [44] "Platelet count"                                          
## [45] "Pulse rate, automated reading"                           
## [46] "Qualifications"                                          
## [47] "Reported occurrences of cancer"                          
## [48] "Reticulocyte count"                                      
## [49] "Sleep duration"                                          
## [50] "Standing height"                                         
## [51] "subject_id"                                              
## [52] "Triglycerides"                                           
## [53] "Type of cancer: ICD10"                                   
## [54] "Type of cancer: ICD10 addendum"                          
## [55] "Type of cancer: ICD9"                                    
## [56] "Weight"                                                  
## [57] "White blood cell (leukocyte) count"                      
## [58] "alias"                                                   
## [59] "description"                                             
## [60] "organism"                                                
## [61] "project"                                                 
## [62] "sex"                                                     
## [63] "phenotype"                                               
## [64] "BioSample_name"                                          
## [65] "ENA-CHECKLIST"                                           
## [66] "Age at death"                                            
## [67] "Age diabetes diagnosed"                                  
## [68] "Date K85 first reported (acute pancreatitis)"            
## [69] "Date K86 first reported (other diseases of pancreas)"    
## [70] "Date of death"                                           
## [71] "Endocrine, nutritional and metabolic diseases"           
## [72] "Source of report of K85 (acute pancreatitis)"            
## [73] "Source of report of K86 (other diseases of pancreas)"    
## [74] "Age high blood pressure diagnosed"                       
## [75] "Mental and behavioural disorders"                        
## [76] "Respiratory system disorders"                            
## [77] "Cancer year/age first occurred"                          
## [78] "Circulatory system disorders"                            
## [79] "Digestive system disorders"                              
## [80] "Nervous system disorders"                                
## [81] "Blood, blood-forming organs and certain immune disorders"
## 
## $cohort2
##  [1] "Age at cancer diagnosis"                                 
##  [2] "Age at recruitment"                                      
##  [3] "Amount of alcohol drunk on a typical drinking day"       
##  [4] "Basophill count"                                         
##  [5] "Behaviour of cancer tumour"                              
##  [6] "Body mass index (BMI)"                                   
##  [7] "C-reactive protein reportability"                        
##  [8] "Cancer record origin"                                    
##  [9] "Cancer report format"                                    
## [10] "Cholesterol"                                             
## [11] "Country of birth (UK/elsewhere)"                         
## [12] "Current tobacco smoking"                                 
## [13] "Date of cancer diagnosis"                                
## [14] "Diabetes diagnosed by doctor"                            
## [15] "Diastolic blood pressure, automated reading"             
## [16] "Duration of moderate activity"                           
## [17] "Duration of vigorous activity"                           
## [18] "Energy"                                                  
## [19] "Eosinophill count"                                       
## [20] "Ethnic background"                                       
## [21] "Ever smoked"                                             
## [22] "Fasting time"                                            
## [23] "Frequency of drinking alcohol"                           
## [24] "HDL cholesterol"                                         
## [25] "Hip circumference"                                       
## [26] "Histology of cancer tumour"                              
## [27] "Home location - east co-ordinate"                        
## [28] "Home location - north co-ordinate"                       
## [29] "Housing score (England)"                                 
## [30] "Housing score (Scotland)"                                
## [31] "Housing score (Wales)"                                   
## [32] "Income score (England)"                                  
## [33] "Income score (Wales)"                                    
## [34] "LDL direct"                                              
## [35] "Lymphocyte count"                                        
## [36] "Monocyte count"                                          
## [37] "Neutrophill count"                                       
## [38] "Non-accidental death in close genetic family"            
## [39] "Number in household"                                     
## [40] "Number of children fathered"                             
## [41] "Number of operations, self-reported"                     
## [42] "Operative procedures - OPCS4"                            
## [43] "Past tobacco smoking"                                    
## [44] "Platelet count"                                          
## [45] "Pulse rate, automated reading"                           
## [46] "Qualifications"                                          
## [47] "Reported occurrences of cancer"                          
## [48] "Reticulocyte count"                                      
## [49] "Sleep duration"                                          
## [50] "Standing height"                                         
## [51] "subject_id"                                              
## [52] "Triglycerides"                                           
## [53] "Type of cancer: ICD10"                                   
## [54] "Type of cancer: ICD10 addendum"                          
## [55] "Type of cancer: ICD9"                                    
## [56] "Weight"                                                  
## [57] "White blood cell (leukocyte) count"                      
## [58] "alias"                                                   
## [59] "description"                                             
## [60] "organism"                                                
## [61] "project"                                                 
## [62] "sex"                                                     
## [63] "phenotype"                                               
## [64] "BioSample_name"                                          
## [65] "ENA-CHECKLIST"                                           
## [66] "Age at death"                                            
## [67] "Age diabetes diagnosed"                                  
## [68] "Date K85 first reported (acute pancreatitis)"            
## [69] "Date K86 first reported (other diseases of pancreas)"    
## [70] "Date of death"                                           
## [71] "Endocrine, nutritional and metabolic diseases"           
## [72] "Source of report of K85 (acute pancreatitis)"            
## [73] "Source of report of K86 (other diseases of pancreas)"    
## [74] "Age high blood pressure diagnosed"                       
## [75] "Mental and behavioural disorders"                        
## [76] "Respiratory system disorders"                            
## [77] "Cancer year/age first occurred"                          
## [78] "Circulatory system disorders"                            
## [79] "Digestive system disorders"                              
## [80] "Nervous system disorders"                                
## [81] "Blood, blood-forming organs and certain immune disorders"
  • ID col: Number 51
  • Sex col: Number 62
  • case-control col: 38

Getting the encodings of the sex and case-control columns

ds.table("feno_object$sex")
## 
##  Data in all studies were valid 
## 
## Study 1 :  No errors reported from this study
## Study 2 :  No errors reported from this study
## $output.list
## $output.list$TABLE_rvar.by.study_row.props
##                study
## feno_object$sex cohort1 cohort2
##          female     0.5     0.5
##          male       0.5     0.5
## 
## $output.list$TABLE_rvar.by.study_col.props
##                study
## feno_object$sex   cohort1   cohort2
##          female 0.5075879 0.5075879
##          male   0.4924121 0.4924121
## 
## $output.list$TABLE_rvar.by.study_counts
##                study
## feno_object$sex cohort1 cohort2
##          female    1271    1271
##          male      1233    1233
## 
## $output.list$TABLES.COMBINED_all.sources_proportions
## feno_object$sex
## female   male 
##  0.508  0.492 
## 
## $output.list$TABLES.COMBINED_all.sources_counts
## feno_object$sex
## female   male 
##   2542   2466 
## 
## 
## $validity.message
## [1] "Data in all studies were valid"
# ds.table("feno_object$`Non-accidental death in close genetic family`")
  • Male encoding: male
  • Female encoding: female
  • Case encoding: Yes
  • Control encoding: No

With all this information we can now merge the covariates and GDS data to a GenotypeData object. We have to do it for every chromosome keeping the same covariates file for all of them.

ds.GenotypeData(x='gds19_object', covars = 'feno_object', columnId = 51, sexId = 62,
                male_encoding = "male", female_encoding = "female",
                case_control_column = 38, case = "Yes", control = "No",
                newobj.name = 'gds.Data19', datasources = conns)
ds.GenotypeData(x='gds21_object', covars = 'feno_object', columnId = 51, sexId = 62,
                male_encoding = "male", female_encoding = "female",
                case_control_column = 38, case = "Yes", control = "No",
                newobj.name = 'gds.Data21', datasources = conns)

Now we are ready to perform a GWAS analysis. It is important noting that we have to pass a formula object to the ds.GWAS function. The variables from the covariates file may contain some special characters and spaces that are not allowed by the R formula element, to account for that we should use th make.names function. As an example, we may want to study the following:

Non-accidental death in close genetic family ~ sex + HDL cholesterol

We pass every variable name through make.names.

make.names("Non-accidental death in close genetic family")
## [1] "Non.accidental.death.in.close.genetic.family"
make.names("sex")
## [1] "sex"
make.names("HDL cholesterol")
## [1] "HDL.cholesterol"

And we construct or final formula with the outputs.

Non.accidental.death.in.close.genetic.family ~ sex + HDL.cholesterol

3.2.5 Performing the GWAS

We now proceed to perform the GWAS analysis.

results <- ds.GWAS(genoData = c("gds.Data19", "gds.Data21"), model = Non.accidental.death.in.close.genetic.family ~ sex + HDL.cholesterol)

And we can see the results for each cohort.

Cohort 1

Cohort 2

3.2.6 GWAS Plots

We can display the results of the GWAS using a Manhattan plot for each cohort.

Cohort 1

manhattan(results[[1]])

Cohort 2

manhattan(results[[2]])

datashield.logout(conns)

3.2.7 Meta-analysis

Once all the results of the GWAS are gathered, a meta-analysis can be performed. Each researcher may have their own pipelines and methodologies to do so, one option would be to use the metafor package.

fer una explicacio semblant a la de single cohort, dir una mica com sha fet aquest case study (nomes 10k snps i la divisio de individus per simular dos cohorts diferents) despres ficar mes o menys el mateix codi que a single cohort. el plot del manhattan shan de treure els numeros! ara ia estan com a per defecte ala funcion dsOmicsClient::manhattan (te mes sentit aixi)

3.3 Polygenic risk scores

3.3.1 Definition

By checking for specific variants and quantifying them, a researcher can extract the polygenic risk score (PRS) of an individual, which translates to an associated risk of the individual versus a concrete phenotype. A definition of the term reads

“A weighted sum of the number of risk alleles carried by an individual, where the risk alleles and their weights are defined by the loci and their measured effects as detected by genome wide association studies.” (Extracted from Torkamani, Wineinger, and Topol (2018))

The use of PRS markers is very popular nowadays and it has been proved to be a good tool to asses associated risks Escott-Price et al. (2017), Forgetta et al. (2020), Kramer et al. (2020).

3.3.2 Study configuration

When calculating the PRS, there is only one configuration possible, single-cohort. There is no need to perform analysis of multi cohorts at the same time, as the aggregated results correspond to each individual, and the individuals are unique for every cohort. This kind of analysis guarantees GDPR data confidentiality. Therefore the infrastructure schematic is almost identical as the GWAS single-cohort analysis.

Proposed infrastructure to perform PRS studies.

(#fig:prs_cohort_image)Proposed infrastructure to perform PRS studies.

The only difference is that we only need the Genotype information to calculate the PRS, so there is no pheotypes table involved. The resources used by the PRS functionality are the same VCF used for the GWAS.

3.3.3 Source of polygenic scores

There are two ways of stating the SNPs of interest and their weights in order to calculate the PRS.

  • Providing a ROI (region of interest) table. The ROI table has to have a defined structure and column names in order to be understood by this function. This table can be formulated using two schemas:

    • Scheme 1: Provide SNP positions. Use the column names: “chr_name,” “chr_position,” “effect_allele,” “reference_allele,” “effect_weight
    • Scheme 2: Provide SNP id’s. Use the column names: “rsID,” “effect_allele,” “reference_allele,” “effect_weight.”

The effect_weight have to be the betas (log(OR)).

  • Provide a PGS Catalog ‘Polygenic Score ID & Name.’ If this option is in use, the SNPs and beta weights will be downloaded from the PGS Catalog and will be used to calculate the PRS.

Please note when using a custom ROI table that it is much faster to use the Scheme 1, as the subset of the VCF files is miles faster to do using chromosome name and positions rather than SNP id’s.

For the use case illustrated on this vignette, the PGS Catalog will be used.

3.3.4 Connection to the Opal server

We have to create an Opal connection object to the cohort server. We do that using the following DSI functions.

# require('DSI')
# require('DSOpal')
# require('dsBaseClient')
# require('dsOmicsClient')
# 
# builder <- DSI::newDSLoginBuilder()
# builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
#                user =  "dsuser", password = "password",
#                driver = "OpalDriver", profile = "omics")
# logindata <- builder$build()
# conns <- DSI::datashield.login(logins = logindata)

3.3.5 Assign the VCF resources

Now that we have created a connection object to the Opal, we have triggered a new R session on the server. Once this R session is active, we can begin assigning resources to it.

# DSI::datashield.assign.resource(conns, "chr19", "GWAS.chr19A")
# DSI::datashield.assign.resource(conns, "chr21", "GWAS.chr21A")

This resources that we have assigned are pointers to the VCF files that we want to use. There is one crucial difference between this analysis and the GWAS, here we don’t need to resolve the resources. This is for memory efficiency purposes. When we calculate a PRS, we are only interested in a small selection of SNPs, for that reason the (potentially) huge VCF files will be trimmed before being resolved, so only the relevant SNPs are in the memory. For that reason the resolving of the resources is delegated to the ds.PRS function, which will perform this trimming according to the SNPs of interest yielded by the PGS catalog query.

3.3.6 Calculate the PRS

For this example we will calculate the PRS associated to the trait ‘Coronary artery disease,’ which corresponds to the PGS ID PGS000011.

# ds.PRS(list(chr19 = "chr19", chr21 = "chr21"), pgs_id = "PGS000011")

On this example the chromosomes are separated into different VCF resource, having a big VCF file with all the variants would also work (illustrated in the next chunk).

## Do not run
# ds.PRS(list(whole_vcf = "whole_vcf_resource_name"), pgs_id = "PGS000011")
datashield.logout(conns)

aqui no fa falta especificar si single o multicohort pk no sa de fer meta analisis, cada individu te el seu outcome i aspavil! explicar aixo una mica per sobre, ficar un diagrama de draw.io de com funcione la funcio per dins, dir que ara mateix nomes es fa servir el PGS catalog pero que es pot ampliar al futur per tu passar la teua puta llista de snps i riscos per tal que funcioni ez gg, o potser feru directament i mostrar les dos opcions, casi que millor feru aixi sisis.

3.3.7 Internal structure of the PRS functionality

The PRS function can be divided into three main blocks that perform the required tasks to come up with the results. The three functional blocks are the following:

  • Client: Assemble the ROI table if PGS Catalog is used, perform calls to the server to subset the resources and calculate the PRS with the subsetted resources.
  • Server resolver: Resolve the selected resources (VCF files) and subset them using the ROI.
  • Server analysis: Use the resolved resources (small VCF files with only the SNPs of interest according to the ROI table) to calculate the PRS.

In order to introduce a little how those three blocks work internally, schematized flow charts have been designed. To understand the exact inner working of the functionality it is advised to follow the flowcharts alongside the actual source code. Find the links to the three blocks source code, client, server resolver and server analysis.

3.3.7.1 Client

The client performs two basic tasks. The first one is to select between a user introduced ROI or to retrieve the ROI from the PGS Catalog. Please note that ROI stands for Region Of Interest, which may not be the common term for PGS analysis but it makes sense since it refers to a table that contain the SNP’s of interest and their weight to the score. If the user introduces a custom ROI and a PGS Catalog ID, only the introduced table will be used, discarding the PGS Catalog ID information. Once this table is assembled, the next step is to call the required functions on the study servers. First, the resource resolver is called, and after that, the function that calculates the PRS is called.

This is illustrated on the following figure.

Flow chart of the client block.

(#fig:block1_image)Flow chart of the client block.

3.3.7.2 Server resolver

The server resolver is in charge of resolving the resources. The interesting aspect is that only a region of interest is actually assigned to the R session, this is to avoid overloading the memory with unnecessary information. There are two different methodologies to perform this subsetting, one is using chromosome names and position and the other is using the SNP id’s. Due to the technical specification of VCF files, is much easier to perform the subsetting using chromosome names and positions because there is an indexing file for the VCF files to perform fast queries of regions by position. On the other hand, to filter a VCF file using SNP id’s, the whole file has to be scanned, yielding a much slower process.

This block is illustrated on the following figure.

Flow chart of the server resolver block.

(#fig:block2_image)Flow chart of the server resolver block.

3.3.7.3 Server analysis

Many processes are performed inside the analysis block. For that reason, more than a flow chart, a step by step guide has been illustrated with the objects that are created (or modified) on each step. The most important step on this block is making sure that the alternate (risk) alleles match between the VCF resources and the alternate alleles stated by the ROI or PGS Catalog.

The block is illustrated on the following figure.

Flow chart of the server analysis block.

(#fig:block3_image)Flow chart of the server analysis block.

References

Escott-Price, Valentina, Amanda J Myers, Matt Huentelman, and John Hardy. 2017. “Polygenic Risk Score Analysis of Pathologically Confirmed Alzheimer Disease.” Annals of Neurology 82 (2): 311–14.
Forgetta, Vincenzo, Julyan Keller-Baruch, Marie Forest, Audrey Durand, Sahir Bhatnagar, John P Kemp, Maria Nethander, et al. 2020. “Development of a Polygenic Risk Score to Improve Screening for Fracture Risk: A Genetic Risk Prediction Study.” PLoS Medicine 17 (7): e1003152.
Kramer, Iris, Maartje J Hooning, Nasim Mavaddat, Michael Hauptmann, Renske Keeman, Ewout W Steyerberg, Daniele Giardiello, et al. 2020. “Breast Cancer Polygenic Risk Score and Contralateral Breast Cancer Risk.” The American Journal of Human Genetics 107 (5): 837–48.
Torkamani, Ali, Nathan E Wineinger, and Eric J Topol. 2018. “The Personal and Clinical Utility of Polygenic Risk Scores.” Nature Reviews Genetics 19 (9): 581–90.